# Remember and Forget for Experience Replay (ReF-ER)

## 1 Overview

Remember and Forget for Experience Replay (ReF-ER) dismisses transtions from “far policy”. The key metrics, importance ($$\rho_t$$), is the ratio between the probability of selecting ($$a_t$$) with the current policy ($$\pi ^{w}$$) and with the behavior policy ($$\mu _t$$); $$\rho _t = \pi (a_t\mid s_t) / \mu _t (a_t\mid s_t)$$.

If $$1/c_{\text{max}} < \rho _t < c_{\text{max}}$$ then it is classified as “near-policy”, otherwise “far-policy”. The gradients ($$\hat{g}(w)$$) computed from far-policy are clipped to 0.

Additionally, penalty term ($$\hat{g}^D(w)$$) is defined as follows; $$\hat{g}^D(w)=E[\nabla D_{\text{KL}}(\mu _k ( \cdot \mid s_k)\| \pi ^w ( \cdot \mid s_k))]$$.

These two terms are added with annealing parameter $$\beta$$;

$\hat{g}^{\text{ReF-ER}}(w) = \beta \hat{g}(w) + (1-\beta) \hat{g}^D(w)$

The $$\beta$$ is updated at each step by following rule;

$\beta \leftarrow \cases{ (1-\eta )\beta & if $$n_{\text{far}} /N > D$$ \cr (1-\eta )\beta + \eta & otherwise}$

where $$\eta$$ is learning rate of neural network, $$N$$ is the number of total samples in the replay buffer, $$n_{\text{far}}$$ is the number of far policy samples in the replay buffer, and $$D$$ is a hyperparameter.

## 2 With cpprb

Under investigation. It is still not clear how do the authors sample from replay buffer. We continue to investigate their code.